Performance Evaluation of Apriori Algorithm on a Hadoop Cluster

نویسندگان

  • JÁNOS ILLÉS
  • ISTVÁN VAJK
چکیده

Frequent Itemset Mining is a well-known concept in data sciences. If we feed frequent itemset miner algorithms with large datasets they become resource hungry fast as their search space explodes. This problem is even more apparent when we try to use them on Big Data. Recent advances in parallel programming provides good solutions to deal with large datasets but they present their own problems when we try to modify existing data mining algorithms for the new paradigms. The Apriori-algorithm is a classic solution for mining frequent item-sets. In this paper, we provide a parallel implementation of the Apriori algorithm for the Hadoop platform. We introduce a method to measure the performance of the distributed algorithm. In our experimental results we find choke points in the algorithm and provide resolutions. Key–Words: Hadoop, MapReduce, Apriori-algorithm, Frequent itemset mining, Cloud computing

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

Mining frequent itemsets from massive datasets is always being a most important problem of data mining. Apriori is the most popular and simplest algorithm for frequent itemset mining. To enhance the efficiency and scalability of Apriori, a number of algorithms have been proposed addressing the design of efficient data structures, minimizing database scan and parallel and distributed processing....

متن کامل

Performance optimization of MapRe duce-base d Apriori algorithm on Hadoop cluster

Many techniques have been proposed to implement the Apriori algorithm on MapReduce framework but only a few have focused on performance improvement. FPC (Fixed Passes Combined-counting) and DPC (Dynamic Passes Combined-counting) algorithms combine multiple passes of Apriori in a single MapReduce phase to reduce the execution time. In this paper, we propose improved MapReduce based Apriori algor...

متن کامل

Mining Frequent Item Sets Using Map Reduce Paradigm

In Text categorization techniques like Text classification or clustering, finding frequent item sets is an acquainted method in the current research trends. Even though finding frequent item sets using Apriori algorithm is a widespread method, later DHP, partitioning, sampling, DIC, Eclat, FP-growth, H-mine algorithms were shown better performance than Apriori in standalone systems. In real sce...

متن کامل

An Efficient Implementation of Apriori Algorithm Based on Hadoop-mapreduce Model

Finding frequent itemsets is one of the most important fields of data mining. Apriori algorithm is the most established algorithm for finding frequent itemsets from a transactional dataset; however, it needs to scan the dataset many times and to generate many candidate itemsets. Unfortunately, when the dataset size is huge, both memory use and computational cost can still be very expensive. In ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014